Geographic destiny trumps taxonomy in the Roundtail Chub, Gila robusta species complex (Teleostei, Leuciscidae)

The Gila robusta species complex in the lower reaches of the Colorado River includes three nominal and contested species (G. robusta, G. intermedia, and G. nigra) originally defined by morphological and meristic characters. In subsequent investigations, none of these characters proved diagnostic, and species assignments were based on capture location. Two recent studies applied conservation genomics to assess species boundaries and reached contrasting conclusions: an ezRAD phylogenetic study resolved 5 lineages with poor alignment to species categories and proposed a single species with multiple population partitions. In contrast, a dd-RAD coalescent study concluded that the three nominal species are well-supported evolutionarily lineages. Here we developed a draft genome (~ 1.229 Gbp) to apply genome-wide coverage (10,246 SNPs) with nearly range-wide sampling of specimens (G. robusta N = 266, G. intermedia N = 241, and G. nigra N = 117) to resolve this debate. All three nominal species were polyphyletic, whereas 5 of 8 watersheds were monophyletic. AMOVA partitioned 23.1% of genetic variance among nominal species, 30.9% among watersheds, and the Little Colorado River was highly distinct (FST ranged from 0.79 to 0.88 across analyses). Likewise, DAPC identified watersheds as more distinct than species, with the Little Colorado River having 297 fixed nucleotide differences compared to zero fixed differences among the three nominal species. In every analysis, geography explains more of the observed variance than putative taxonomy, and there are no diagnostic molecular or morphological characters to justify species designation. Our analysis reconciles previous work by showing that species identities based on type location are supported by significant divergence, but natural geographic partitions show consistently greater divergence. Thus, our data confirm Gila robusta as a single polytypic species with roughly a dozen highly isolated geographic populations, providing a strong scientific basis for watershed-based future conservation.

In the review process we were asked to remove the most divergent populations from the analysis to show that our conclusions were robust. The problem with post-hoc exclusion of samples in this system is that we can reconstruct support for divergent conclusions simply by excluding certain geographic samples from the study (Supp Fig. 1). We designed the study to sample the full range of geographic and taxonomic variation for which we could obtain samples within this group, and then generated the data and performed the analyses on coded samples so that everyone was blind to the site of origin and nominal taxonomy. Only after generating the data did we reassign the sample identifiers and explicitly test the alternative hypotheses of nominal taxonomy versus geographic structuring. Thus, we are opposed to any post-hoc manipulation of which samples are included in the analyses, because that decision can greatly bias the outcome of the analyses and conclusions drawn from them. For example, exclusion of a handful of sites for high levels of divergence or taxonomic uncertainty of the samples in that watershed can recover support for monophyletic lineages that support preconceived notions about the validity of the nominal taxa, but that are at odds with the underlying distribution (Supp Fig. 1), in the same way that using species names as priors in the DAPC can recover support for those names because morphological, genetic and geographic distinctiveness are confounded in this system.
Given that we were required to redo the analysis in review, we also provide it here for interested readers. Given the genetic divergence of the LCR population is most likely to alter the outcome of the hierarchical AMOVA analysis comparing relative support for the nominal taxonomy compared to the geographic distinctions, we re-ran this analysis to determine if exclusion of LCR changes the outcome. We get essentially the same result from the AMOVA with or without LCR (modified Table 2, below). If anything, this subset of the data provides even less support for the nominal species (11.6 vs 22.8%) than when LCR is included (23.1 vs 30.9%). The same was true of the STRUCTURE analyses which were qualitatively similar and did not recover support for three nominal taxa with or without LCR. Thus, given that the divergence of LCR does not alter the outcome or interpretations of this study, we include all data throughout this manuscript as originally designed and do not exclude any samples from our analyses. Below we provide the code and detailed results of the DAPC for those who wish to repeat, modify or dig further into our analyses.

Prepare environment
Prepare environment by setting working directory, setting random seed for reproducibility, loading required package(s) and data. Since the longest part of this step is the vcfR2genind conversion, I previously saved the genind output object in an Rdata file. I also had to correct some of the mislabelled samples. If running this for the first time, un-comment the single # lines or run them separately.  -factor(gsub("[0-9].*|_[0-9].*", "", ids_vector)) # save(gila, file = "chub_genind.Rdata") load("chub_genind.Rdata")

De novo cluster (k) selection
First, I run find.clusters() (k-means algorithm for DAPC) in the interactive setting to get a plot of BIC ~ number of clusters. This gives me a general idea of what to expect the "optimal K" might be. In this case, I use all principal components (PCs) as there is no reason not to keep all of them at this stage.
### Get initial graph of BIC ~ number of clusters clusters4graph <-find.clusters(gila, pca.select = "nbEig", n.pca = nrow(gila@tab), max.n.clust = 45) ## Choose the number of clusters (>=2: From the graph above, it looks like the minimum/lowest BIC score would be with 23 clusters (after which point there is a generally steady increase in BIC). This is likely to change slightly across different iterations. As it is along more or less of an asymptote, the "min" criterion for cluster selection may not be ideal (lower k would probably suffice). Similarly, there are several points where the BIC goes up (bumps) while it is still not near the minimum, so simple "goesup" selection criteria would not be ideal either. This leaves three other potential criteria to choose the "optimal" number of clusters: "diffNgroup", "smoothNgoesup", and "goodfit". I run k selection using each of these criteria below.
### Get initial graph of BIC ~ number of clusters diffNgroup_clusters <-find.clusters(gila, pca.select = "nbEig", n.pca = nrow(gila@tab), choose.n.clust = F, criterion = "diffNgroup", max.n.clust = 45) smoothNgoesup_clusters <-find.clusters(gila, pca.select = "nbEig", n.pca = nrow(gila@tab), choose.n.clust = F, criterion = "smoothNgoesup", max.n.clust = 45) goodfit_clusters <-find.clusters(gila, pca.select = "nbEig", n.pca = nrow(gila@tab), choose.n.clust = F, criterion = "goodfit", max.n.clust = 45) It is worth noting that the "diffNgroup" method is recommended for users that are "unsure" about the various criteria. This method is also said to possibly be unstable when there are initial, very sharp decreases in the test statistic (BIC here). It may be informative that the high number of clusters chosen by this method essentially corresponds to nearly the number of streams sampled in this study (n=45). While this implies that the genetic variation among our samples can best be explained between these sampling groups, it does not provide much more information about them than we knew a priori. Both the "smoothNgoesup" and "goodfit" agree on k = 12, so I continue with that k as a compromise. I check the membership of each cluster with regards to the streams their samples come from later on while examining the relationships between clusters on the DAPC discriminant functions (DFs) . They do not. I move forward with the clusters made by "goodfit" since they had lower BIC.

DAPC PC selection and main mapping with de novo clusters
Now I run the DAPC analysis. This is said to benefit from using fewer principle components, so similar to the DAPC tutorial mentioned above and as recommended by the "Recommended standard reporting for DAPC analyses" in Box 2 of Miller et al. (2020) (http://www.nature.com/articles/s41437-020-0348-2), we use a-score and the optim.a.score() function on a preliminary DAPC to determine the optimum number of principle components to retain. 58 PCs were retained in the final DAPC, and all DFs were kept for further examination. The proportion of variance explained by the first four PCs is also reported.
## quartz_off_screen ## 2 ## quartz_off_screen ## 2 Closer look at DF1 The first DF strongly separates group 7 from all other clusters. Even zoomed in towards the middle we see that all other clusters are still close together and overlapping. I check whether this large difference on DF1 between cluster 7 and the others is caused by particular loci. The large eigenvalue for DF1 appears to be caused by many loci.
contrib <-loadingplot(dapc$var.contr, axis = 1, thres = 4.7e-4, lab.jitter = 10) While the locus with the largest loading is variable in many clusters (while fixed in some), the locus with the second largest loading is fixed in cluster 7 for an allele opposite to one that is nearly fixed across all other clusters.

Fixed private alleles in clusters
This begs the question as to whether there are fixed differences unique to certain clusters (i.e. fixed private alleles). There are 297 fixed differences unique to cluster 7. The only other cluster with fixed private alleles is cluster 10 which has six. None of these fixed private alleles in cluster 10 were on the same contig.  ,2])==1, x, NA)))))) } total_fixed_diffs <-sapply(cluster_fixed_differences, length) names(total_fixed_diffs) <-paste0("C", seq_along(total_fixed_diffs)) total_fixed_diffs There are some contigs and/or scaffolds with multiple fixed differences unique to cluster 7. The graph below shows the number of contigs that have n cluster 7 fixed private alleles. Although most contigs with a fixed allele unique to cluster 7 contain no more than one such locus, up to five can be found on a single contig.
multivars_cs_section <-pull(fixed_by_section[which(fixed_by_section$n>3), "cs_section"]) ggplot(group_7_fixed, aes(x = cs_section)) + geom_bar(width = 1) + geom_text(stat = "count", aes(label = ifelse(cs_section %in% multivars_cs_section, as.character(cs_section), "")), vjust = -1, size = 3) + scale_y_continuous(limits = c(0, 5.5), expand = c(0,0)) + theme_classic() + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title.x = element_blank()) It looks like some of these fixed SNPs are quite close together on their contigs (e.g. the four sites on contig 1122), but many are spread out too. I checked these contigs and loci in the annotation, only one coding sequence overlapped and it was in an uncharacterized zebrafish gene). DF2 therefore distinguishes all samples from the two most western watersheds (Bill Williiams River and Agua Fria River) from all other samples. This DF therefore groups clusters containing different species as more similar to each other than they are to all other samples, including other samples of their own purported species. Cluster 10 had some fixed private alleles, so the loci loadings may be interesting to look at for this DF as well. Their appears to be greater variance in the loading of loci in this DF, such that some loci have much larger loadings than others or than average. I take a closer look at the loci with the two largest loadings in the Appendix. These two loci were similar to the two examined with DF1, and neither were cluster 10 fixed private alleles.
contrib <-loadingplot(dapc$var.contr, axis = 2, thres = 0.00125, lab.jitter = 10) DF3 again largely distinguishes clusters 1, 8, and 10 from the others. However, this time cluster 10 is distinguished from all others clusters to one side, while clusters 1 and 8 are separated from all remaining clusters to the other side. DF3 therefore largely differentiates the Agua Fria River watershed from the Bill Williams watershed (the two that were grouped together in DF2).

scatter(dapc, 3,3)
The locus loadings of DF3 appear to be intermediate relative to the previous two DFs. None are as large as the largest two examined with DF2, but there is still quite a large difference between some of the smallest and largest. The two largest are also not any of the fixed private alleles of cluster 10, and are examined further in the Appendix.
contrib <-loadingplot(dapc$var.contr, axis = 3, thres = 0.00095, lab.jitter = 10) DF4 separates cluster 5 and cluster 9 on opposite sides of each other, with all other clusters grouped between them. On the same side of the DF as cluster 5 are clusters 6 and 12. Their distribution on DF4 has little to no overlap with the main group of clusters, even if they are not particularly far away from the main group relative to cluster 5. Cluster 12 (like cluster 6) contains all 21 samples from two locations in the Verde River watershed. Those from WAL are said to be G. intermedia, while those from WTBV are said to be G. robusta.
The eigenvalues start to flatten out more after DF4. Some comments about the other DFs, as well as which clusters they discriminate, are in the first part of the Appendix which covers the membership of clusters not yet discussed.

Main results summary
To recap overall results, 12 clusters were selected by k-means for DAPC analysis. These clusters contained 15-109 individuals from 1-8 streams in 1-4 watersheds. Three clusters contained individuals that would be considered different species based on their sampling locations, with one of those three clusters containing all three putative species.
HSC was the only stream to not group entirely into one cluster: a single sample from there was assigned to cluster 3 rather than with the other 14 samples from its stream in cluster 4. This means that only 1 out of 624 total samples in this study did not cluster with the rest of its stream.

Interpretation
The first part of characterizing genetic variance in this analysis involved using k-means to create de novo clusters. At this stage, samples of multiple putative species were grouped into the same cluster multiple times (as seen in the Tables above that summarize the descriptions of cluster membership in the main analysis and the Appendix).
The DAPC analysis itself mapped clusters onto different axes based on their similarity. The LCR (G. robusta [?]) samples are by far the most distinct of the 624 in this study.
A group of three clusters containing all samples from the Bill Williams River watershed and the Agua Fria River watershed were then grouped together away from all other samples on the second DF (DF2).
This DF2 result contradicts with the three species hypothesis in multiple ways: 1. The three clusters separated from all others on DF2 contains two different putative species: Bill Williams River Gila are said to be G. robusta, while the Agua Fria River Gila are said to be G. intermedia. 2. The clusters in the main DF2 group contain representatives from all three putative species. 3. A simpler explanation of DF2 is that these three separated clusters contain all samples from the two westernmost watersheds sampled in this study.
The third DF then primarily differentiates between these two watershed groups.
The fourth DF produces yet another geographic split in which multiple putative species are lumped together. The majority of samples from the Salt River watershed (cluster 9, putatively G. nigra) are split apart from the majority of samples from the Verde River watershed (cluster 5, putatively G. robusta; cluster 6, putatively G. intermedia; and cluster 12, ~2/3 G. robusta and ~1/3 G. intermedia).
Overall, both the de novo clustering and DAPC analyses consistently group samples of different putative species together, while segregating watersheds and streams much less frequently.